\documentclass[10pt, a4paper, twocolumn]{article} % 10pt font size (11 and 12 also possible), A4 paper (letterpaper for US letter) and two column layout (remove for one column)

\input{structure.tex} % Specifies the document structure and loads requires packages

\newcommand{\s}[1]{\textcolor{Blue}{#1}}
\newcommand{\p}[1]{\textcolor{Red}{#1}}
\newcommand{\w}[1]{\textcolor{Green}{#1}}

\title{Succinct Data Structures and Delta Encoding for Modern Databases}

\author{
  \authorstyle{Matthijs van Otterdijk\textsuperscript{1,2} and Gavin Mendel-Gleason\textsuperscript{1,3} and Kevin Feeney\textsuperscript{1,4}}
  \newline\newline
  \textsuperscript{1}\institution{TerminusDB {\protect\url{http://terminusdb.com}}}\\
  \textsuperscript{2}\institution{\protect\url{matthijs@datachemist.com}}\\
  \textsuperscript{3}\institution{\protect\url{gavin@datachemist.com}}\\
  \textsuperscript{4}\institution{\protect\url{kevin@datachemist.com}}
}

\date{\today}

\begin{document}

\maketitle

\thispagestyle{firstpage}

\lettrineabstract{Modern hardware architectures include larger main
  memory and pervasive parallelism. Modern software development
  processes now incorporate continuous integration/continuous delivery
  (CI/CD) coupled with revision control. These fundamental changes to
  information technology infrastructure necessitate a re-appraisal of
  database architecture.  TerminusDB makes a radical departure from
  historical architectures to address these changes. First, we
  implement a graph database with a strong schema so as to retain both
  simplicity and generality of design. Second, we implement this graph
  using succinct immutable data structures which enable more sparing
  use of main memory resources. Prudent use of memory reduces cache
  contention while read-only data structures simplify parallel access
  significantly.  Third, we adopted the delta encoding approach to
  updates as is used in source control systems such as git. This
  provides transaction processing and updates to our immutable
  database data structures, recovering standard database management
  features while also providing the whole suite of revision control
  features: branch, merge, squash, rollback, blame, and time-travel
  facilitating CI/CD approaches on data.}

\section{Introduction}

There has been an explosion of new database designs, including graph
databases, key-value stores, document databases, and multi-model
databases. Yet the majority of production databases are still based on
the RDBMS designs of the 1970s\autocite{Codd:1970:RMD:362384.362685}. This
has become a bottleneck in an increasingly automated modern technology
operations environment.

Meanwhile, both hardware infrastructure and software design process
have moved on significantly over the course of the last 40 years. In
particular, machines with terrabytes of RAM are now available for
prices reasonable enough for some small and medium sized enterprises.

At the same time, flexible revision control systems have
revolutionised the software development process. The maintenance of
histories, of records of modification and the ability to roll back
enables engineers to have confidence in making modificiations
collaboratively. This is augmented with important features such as
branching, labelling, rebasing, and cloning. When combined with
continuous integration/continuous
delivery\autocite{65147}\autocite{LAUKKANEN201755} (CI/CD) teams of
programmers can have confidence that central repositories are
maintained in correct states such that they can be safely deployed
once testing and verification have been passed.

These two developments suggest a solution at their
intersection. Namely, the use of in-memory immutable {\em succinct}
data structures and {\em deltas} as used in revision control
systems. TerminusDB demonstrates how these features can be combined to
produce a flexible transactional graph database.

\section{Design}

TerminusDB is a full featured graph database management system (GDBMS)
with a rich query language: WOQL (the Web Object Query Language).
However, we restrict our attention here to the underlying data
structure design and layout which we have implemented in a
Rust\autocite{Blandy:2015:RPL:3019371} library which we call
{\em{terminus-store}}.

We describe in turn the graph database model which is used, the
succinct data structure approach, and finally how we implement
revision control type operations using {\em deltas} which we collect
together with some metadata into objects which we term {\em layers}.

\subsection{Graph Databases}

Graph databases are one of the fastest growing new database
paradigms. Since graphs are very general it is possible to render many
database modeling techniques in a graph database. The simplicity and
generality of graphs make it a good candidate for a {\em
  general-purpose} delta encoded approach to an online transaction
processing database.

The TerminusDB infrastructure is based on the {\em RDF} standard. This
standard specifies finite labelled directed graphs which are
parameteric in some universe of datatypes. The names for nodes and
labels are drawn from a set of IRIs (Internationalized Resource
Identifiers). For TerminusDB we have chosen the {\em XSD} datatypes as
our universe of concrete values.

More formally, in TerminusDB a graph \(G\) is a set of triples drawn
from the set \( IRI \times IRI \times (IRI \oplus XSD)\) where \(IRI\)
is a set of valid IRIs and \(XSD\) is the set of valid XSD values.
While some RDF databases allow multiplicity of triples (i.e. a bag),
the choice of a set simplifies transaction processing in our setting.

For schema design, TerminusDB uses the OWL language with two
modifications to make it suitable as a schema language. Namely, we
dispense with the open world interpretation and insist on the unique
name assumption\autocite{DBLP:journals/semweb/FeeneyMB18}. This provides
us with a rich modelling language which can provide constraints on the
allowable shapes in the graph.

TerminusDB, following on from the RDF tradition, is not a property
graph. However, OWL extends RDF graphs with powerful abstractions such
as classes, restrictions, and strongly typed properties. We can choose
to interpret objects as either nodes or relationships as we please. In
a logical sense, property graphs are equivalent to a single view of a
more expressive OWL graph. This choice leads to a simplification of
the underlying representation, which, as we will see, is important
when constructing succinct data structures with change sets.

\subsection{Succinct Data Structures}

Succinct data structures\autocite{Jacobson:1988:SSD:915547} are a family of
data structures which are close in size to the information theoretic
minimum representation. Technically, they can be defined as data structures
whose size is:

\[ n + o(n) \]

Where \(n\) is the information theoretic minimum size. Succinct
representations are generally somewhat more computationally expensive
than less compact representations with pointers when working with
small datasets. However, as the size of the datastructure grows, the
ability to avoid new cache reads at various levels of the memory
hierarchy (including reading information from disk) means that these
representations can prove very speed
competitive\autocite{doi:10.1002/spe.2198} in practice.

TerminusDB largely borrows its graph data structure design from
HDT\autocite{10.1007/978-3-642-30284-8_36} with some modifications which
simplify the use of change sets. The authors originally evaluated HDT
as a possibility for a graph which was too large to fit in memory when
loaded into postgresql and found that queries on the resulting graph
performed much better in HDT\autocite{gleason_feeney2018}.

In particular, the primary datastructures of the HDT format are
retained, namely {\em front coded dictionaries}, {\em bit sequences}
and {\em wavelet trees}.

\subsubsection{Plain Front-Coding Dictionary}

Due to the unusual quantity of shared prefixes found in RDF data due
to the nature of URIs and IRIs, front-coding provides a fast
dictionary format with significant compression of
data\autocite{MARTINEZPRIETO201673}.

The primary operations exposed by the data structure are {\em
  string-id} which gives us a natural number corresponding with the
string, and {\em id-string}, which gives a string corresponding with a
natural number.

\begin{table}
	\centering
	\begin{tabular}{l|rl}
		\toprule
		String & Offset & Remainder \\
		\midrule
        Pearl Jam & 0 & Pearl Jam \\
        Pink Floyd & 1 & ink Floyd \\
        Pixies & 2 & xies \\
		The Beatles & 0 & The Beatles \\
		The Who & 4 & Who \\
		\bottomrule
	\end{tabular}
    \caption{Plain Front Coding Dictionary}
    \label{tab:pfc}
\end{table}

The data strucure sorts the strings and allows sharing of prefixes by
reference to the number of characters from the preceeding strings
which are shared. An example is given in Table~\ref{tab:pfc}. The
position in the dictionary gives us the implicit natural number
identifier.

\subsubsection{Succinct Graph Encoding}

Once subject, object, and property of an edge have been appropriately
converted into integers by use of the subject-object dictionary, the
value dictionary, and the predicate dictionary, we can use these
integers to encode the triples using bit sequences.

Succinct sequences encode sequences drawing from some alphabet
\(\sigma\). In the case of a bit-sequence, \(\sigma=\{0,1\}\). They
typically expose (at least) the following primitives:

\begin{itemize}
\item \(rank(a, S, i)\) which counts occurances of \(a\) in the sequence from \(S[0,i]\).
\item \(select(a, S, i)\) which returns the location of the \(i\)-th
  occurance of \(a\) in the sequence \(S\).
\item \(access(S, i)\) which returns the symbol at \(S[i]\).
\end{itemize}

Given a sorted set of triples for each subject identifier in order
from \(\{0..n\}\) where \(n\) is the number of triples, we emit a
\(1\) followed by a \(0\) for every predicate associated in a triple
with that subject. We then produce a vector of all predicates used and
the association with the subject is apparent from the position of
zeros in the bit sequence.

We repeat the process for predicates and objects resulting in a
complete encoding for our triples. We can see an example in
Table~\ref{tab:graph}.  We have written the vectors in this table so
that the triples are vertically aligned, with subjects in blue,
predicates in red, and objects in green in order to make the encoding
easier to see. The subject identifiers are actually implicit in the number of
\(1\)s encoded in the subject bit sequence and are only written in
the table for clarity.

\begin{table}
	\centering
	\begin{tabular}{l|l|l}
		\toprule
		Triples & Encoding & Description \\
		\midrule
        \(  (\s{1},\p{2},\w{3}) \)  & \(\s{1}\;\;\;\;\s{2} \;\;\;\;\s{3}\) & Subject Ids \\
        \(  (\s{1},\p{2},\w{4}) \)  & \(   1\;\;\;\;    1\;    0\;    1\) & Encoded Subject Bit Sequence\\
        \(  (\s{2},\p{3},\w{5}) \)  & \(\p{2}\;\;\;\;\p{3}\;\p{4}\;\p{5}\) & Predicate Vector \\
	    \(  (\s{2},\p{4},\w{6}) \)  & \(   1\;   0\;    1\;    1\;    1\) & Encoded Predicate Bit Sequence\\
	    \(  (\s{3},\p{5},\w{7}) \)  & \(\w{3}\;\w{4}\;\w{5}\;\w{6}\;\w{7}\) & Object Vector \\
		\bottomrule
	\end{tabular}
    \caption{Succinct Graph Representation}
    \label{tab:graph}
\end{table}

This format allows fast lookup of triples based on query modes in
which the subject identifier is known, as we can use \(select\) to
find the position in the predicate vector and subsequently use the
predicate identifier to \(select\) in the object vector. We use a
wavelet tree to enable search starting from the predicate. Details of
this can be found in \autocite{10.1007/978-3-642-30284-8_36}.

\subsection{Delta Encoded Databases}

The use of {\em delta encoding} in software development is now
ubiquitous due to the enormous popularity of the {\em git} revision
control system which makes use of these techniques to store histories
of revisions.

Git stores objects which contain either the complete dataset of
interest or the information about what is updated (deleted/added) as
a delta. All changes to the source control system are thereby simply
management problems of these objects.

This approach exposes a number of very powerful operations to sofware
developers collaborating on a code base. The operations include:

\begin{itemize}
\item {\bf History:} Since new updates are actually layered over previous
  ones, developers can {\em time travel}, looking into the past,
  rolling back to the past, or even reapplying changes to current
  versions.
\item {\bf Branching:} Developers can create a derived version of a given
  code-base where additional operations can be performed without
  disrupting the original.
\item {\bf Merging:} When two branches diverge, the changes can be merged
  into a single version by choosing a strategy for combining changes.
\end{itemize}

These features have powered a revolution in software engineering and
have elevated the importance of dev-ops automation in modern IT
infrastructures. It would be nice if we could apply them to databases
too and similarly elevate the field of data-ops. However, git itself
is not the solution – git is squarely focused on code management, and
data and code differ in some important fundamental characteristics.

Codebases can be adequately modelled as a hierarchy of directories
containing files, with changes modelled as the addition or subtraction
of lines of text to these files. Databases, by contrast, lack a
universal navigation and addressing mechanism like the filesystem.
They often have complex internal structures which govern the
granularity of updates. They cannot usefully be reduced to the same
universal unit of comparison as used by code: the line of text. Given
that databases can be many times larger than even the biggest
code-base, the fineness of the granularity of diffs is an important
performance factor.

TerminusDB uses nodes and edges, replete with classes and restrictions
to model the structure of data – enabling a fine-granularity in
expression of changes. Otherwise, it uses a similar approach to git for
expressing updates. A given database is comprised of layers which
stand in place of the objects of git. Each layer is identified by a
unique 20-byte name. The base layer contains a simple graph
represented using the succinct data structures described earlier.

\begin{figure}
	\includegraphics[width=\linewidth]{layers-diagram.png} % Figure image
	\caption{A graph composed of layers} % Figure caption
	\label{fig:layers} % Label for referencing with \ref{bear}
\end{figure}

Above this layer, we can have further layers. Each additional layer
above the base layer is comprised of additional dictionaries for newly
added subjects and objects, predicates, or values. It also contains the
index structures used for the base graph to represent {\em positive}
edges which have been added to the graph. And we have a membership set
of {\em negative} edges which describe those triples which have been
deleted as shown in Figure~\ref{fig:layers}.

Each layer has a pointer to the previous layer which is achieved by
referring to its 20-byte name.

This immutable chain structure allows for straightforward uncoordinated
multi-read access sometimes called multiversion concurrency control
(MVCC)\autocite{Mohan:1992:EFM:130283.130306}\autocite{Sadoghi:2014:RDL:2733004.2733006}. This
approach also makes branching simple to implement. Any number of new
layers can point to the same former parent layer.

In order to manage these layers as datastores, we use a {\em label}. A
label is a name which points to one of the 20-byte identifiers. In the
present implementation this is a file with the name of the label
containing the 20-byte identifier.

\subsubsection{Dictionary modifications}

Due to the use of delta encodings, new triples can be added which are
not present in the original dictionary. We therefore start new
dictionaries with a recorded offset, remembering the last bucket from
the previous dictionary.

\subsection{Write Transactions}

When an update transaction is initiated, a new {\em layer builder} is
created, which logs all newly inserted or deleted edges. When this
{\em layer builder} is committed, it yields a {\em layer} which has
organised the insertions in our succinct data structures.

In TerminusDB we require that graphs conform to the constraints
imposed by the OWL description of the dataset. This means that we
produce a hypothetical database by committing the layer builder
without advancing head. First, we check that the constraints hold on this
new intermediate database. After these are passed, it is safe to
advance head to this newly created layer. {\em Advancing} is done by
side-effecting the label to point to the new 20-byte value. The
problem of coordination in the face of side-effects is reduced to the
problem of label management, simplifying much of the architecture. A
schematic of the workflow of the write transaction is given in
Figure~\ref{workflow}.

Automated checking of data constraints is particularly important if we
are to confidently merge database branches that might have divergent
schemas or mutually inconsistent states (e.g. where we have a property
with a cardinality of exactly one in both branches, but with a
different value in each). At a minimum, we need to ensure that merging
branches does not result in the database entering an inconsistent
state and hence, although constraint checking is beyond the scope of
this paper, it is a critical piece of the puzzle in enabling automated
data-ops\autocite{gleason2018}.

\begin{figure}
	\includegraphics[width=\linewidth]{query_write_commit_head} % Figure image
	\caption{Write transaction workflow} % Figure caption
	\label{workflow} % Label for referencing with \ref{bear}
\end{figure}

% More here. 

\subsection{Delta compression}

As new updates are performed the database layer depth increases. This
will incur a performance penalty requiring repeated searching at each
layer for every query. In order to improve performance, we can perform
a {\em delta compression} similar to the mechanisms used in
git. Alternatively, we can recalculate the full dataset as a new
base-layer. In git, this {\em delta compression} step can be performed
manually, or it will occur when a depth threshold is passed.

Since the layers are immutable, this operation can be done
concurrently. Commits that occur before the process is completed simply
layer over the delta with no visible change in the content of
the database.

Compressed deltas of this type can allow older layers to be archived,
or even deleted. The removal of previous layers removes the capacity
to {\em time travel} or to track whether the database arose from a
branch. However, this information can be kept seperately in a metadata
repository allowing memory of the branching structure and other
information about previous commits, but not the capacity to
time-travel to them. We plan to implement this graph metadata
repository in future version of TerminusDB.

\section{Future Work}

Values are stored as strings using a plain front coding dictionary
uniformly for all data types. Obviously, this is less than ideal in
that it causes an expansion in size for the storage of integers,
dates, and other specific types. It also means that, only search from
the beginning of the datatype is optimised. In future versions of
Terminus-store, we hope to differentiate our indexing strategies for
the various datatypes in XSD.

For strings, the use of succinct data structure immediately suggests a
potential candidate: the
FM-index\autocite{Ferragina:2005:ICT:1082036.1082039}. With FM-indexing,
very large datasets could still have reasonable query times for
queries which are typically done on full text indexes using inverted
term-document indexing. We have yet to explore the candidates for
numeric and date types.

Currently the tracking of history and branches is implicit. We intend
to adopt a more explicit approach, storing a graph of the various
commits coupled with timestamps and other metadata which will
facilitate effective management.

\section{Conclusion}

The use of advanced CI/CD workflows for databases as yet has not been
practical due to the lack of tool-chain support. In the software world,
we have seen just what a large impact appropriate tools can make with
advent of git.

TerminusDB makes possible these collaborative CI/CD type operations in
the universe of data management.

This is made possible because of the synergies which an immutable
layered approach has with the {\em succinct data structure} approach
that we have used for encoding.

TerminusDB provides a practical tool for enabling branch, merge,
rollback, and the various automated and manual testing regimes which
they facilitate on a transactional database management system which
can provide sophisticated query support.

\printbibliography[title={Bibliography}]

\end{document}
